The metamoRph_label.R script was used to “auto label” all of the conjunctiva, cornea, lens, retina, rpe, and trabecular meshwork samples in the EiaD. Why not optic nerve? Because the others have at least two studies of samples.
The rough approach of metamoRph_label.R is to make make a subsample of the EiaD dataset (tissues listed above, with 10% randomly chosen tissues for each sample type (at least 5)) and:
Then do that 20 times. This “bootstrapping” approach is designed to not robustly not overfit the models.
The labels and scoring are imported here so I can hand assess the ML labels and identify problematic samples to remove.
The assessments are doing via running a PCA on all of the ocular samples (minus the AMD perturbed ones) and looking for how the ML regression scores (a higher minimum_score means the ML model has lower confidence in the call) and predicted labels vary across the dataset. I then hand assess each potential outlier in the context of their full metadata.
library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1 ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
# remotes::install_github("davemcg/metamoRph")
library(metamoRph)
library(data.table)
data.table 1.14.8 using 6 threads (see ?getDTthreads). Latest news: r-datatable.com
Attaching package: ‘data.table’
The following objects are masked from ‘package:lubridate’:
hour, isoweek, mday, minute, month, quarter, second, wday, week, yday, year
The following objects are masked from ‘package:dplyr’:
between, first, last
The following object is masked from ‘package:purrr’:
transpose
# this file generated by scripts/pca_workup_data_prep.R
mat_all <- vroom::vroom("~/git/EiaD_build/counts/gene_counts.csv.gz") %>% data.table() %>%
filter(!grepl("ENST", Gene)) %>%
mutate(Gene = gsub(" \\(.*","",Gene)) #%>%
Rows: 37951 Columns: 2518── Column specification ───────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Gene
dbl (2517): X105785, X105776, X105787, X105786, X105784, X105777, X105782, X105780, X105783, X105779, X...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mat_all <- mat_all[, lapply(.SD, sum, na.rm=TRUE), by='Gene' ]
genes <- mat_all$Gene
mat_all <- mat_all[,-1] %>% as.matrix()
row.names(mat_all) <- genes
mat_all[1:10,1:10]
X105785 X105776 X105787 X105786 X105784 X105777 X105782 X105780 X105783 X105779
A1BG 464.832 384.126 690.901 399.524 460.425 305.097 417.539 437.844 406.081 333.443
A1CF 4.000 0.000 0.000 0.000 0.000 3.000 0.000 1.000 2.000 0.000
A2M 119.000 149.000 170.000 104.000 428.000 277.000 251.000 305.000 615.000 44.000
A2ML1 79.580 63.658 59.888 52.597 63.026 76.072 49.136 58.488 48.945 30.350
A2MP1 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
A3GALT2 0.000 0.000 3.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000
A4GALT 523.070 296.013 575.462 376.594 255.000 228.999 388.259 294.158 336.326 257.298
A4GNT 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
AAAS 2271.998 2064.000 2979.001 2025.001 1719.001 1909.000 2406.000 1994.999 2085.000 2405.999
AACS 1562.418 1603.641 2041.813 1742.944 1207.900 1311.544 1706.899 1360.067 1439.022 1346.006
emeta <- data.table::fread('../data/eyeIntegration22_meta_2023_03_03.csv.gz')
# from script/eigenProjecR_label.R
predictions <- read_tsv('../data/2023_08_12_ML_tissue_predictions.tsv.gz')
Rows: 783 Columns: 7── Column specification ───────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): sample_id, sample_label, predict
dbl (4): miscall_count, mean_max_score, min, max
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pred_values <- read_tsv('../data/2023_08_12_ML_tissue_values.tsv.gz')
Rows: 783 Columns: 10── Column specification ───────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): sample_id, sample_label, predict
dbl (7): mean_max_score, Conjunctiva, Cornea, Lens, Retina, RPE, Trabecular.Meshwork
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tissues <- c("Conjunctiva", "Cornea", "Lens", "Retina", "RPE", "Trabecular Meshwork")
emeta_filter <- emeta %>%
as_tibble() %>%
filter(study_accession != 'SRP012682') %>%
filter(Tissue %in% tissues) %>%
filter(!grepl("AMD", Perturbation)) %>%
select(sample_accession:study_title,Source_details ) %>%
unique()
emeta_filter %>%
mutate(Perturbation = case_when(grepl('MGS', Source_details) ~ Source_details),
Perturbation = gsub("Adult Tissue","", Perturbation)) %>%
mutate(Sub_Tissue = case_when(is.na(Sub_Tissue) ~ '', TRUE ~ Sub_Tissue),
Source = case_when(is.na(Source) ~ '', TRUE ~ Source),
Age = case_when(is.na(Age) ~ '', TRUE ~ Age),
Perturbation = case_when(is.na(Perturbation) ~ '', TRUE ~ Perturbation)) %>%
group_by(Tissue, Source, Sub_Tissue, Age, Perturbation, study_accession) %>%
summarise(`Study Count` = length(unique(study_accession)),
`Sample Count` = n()) %>%
mutate(Tissue = case_when(grepl("Trab",Tissue) ~ 'TM', grepl("EyeLid", Tissue) ~ "Eye Lid", TRUE ~ Tissue),
Source = case_when(grepl("Primary", Source) ~ "P. Culture",
TRUE ~ Source)
) %>%
ggplot(aes(x ='', y=`Sample Count`, fill = study_accession)) +
ggh4x::facet_nested(Tissue+Source+Sub_Tissue+Age ~ ., scale = 'free', space= 'free', switch = 'y',
strip = ggh4x::strip_nested(size = "variable"))+
geom_bar(stat = 'identity') +
xlab('') +
coord_flip() +
theme_minimal() +
theme(strip.text.y.left = element_text(angle = 0, size = 12), text = element_text(size = 16)) +
theme(strip.background = element_rect(colour="gray", fill="gray"),
strip.switch.pad.wrap = margin(t = 0, r = 0, b = 0, l = 0)) +
theme(legend.position = "none") +
scale_fill_manual(values = c(pals::alphabet() , pals::alphabet2(), pals::glasbey()) %>% unname())
`summarise()` has grouped output by 'Tissue', 'Source', 'Sub_Tissue', 'Age', 'Perturbation'. You can override using the `.groups` argument.
Data built from scripts/pca_workup_data_prep.R and run on the non-AMD perturbed ocular bulk RNA-seq samples.
library(metamoRph)
pca_eiad <- run_pca(mat_all[,(emeta_filter %>% pull(sample_accession))], emeta_filter)
Sample CPM scaling
log1p scaling
prcomp start
PCA <- pca_eiad[[1]]
dataGG <- cbind(PCA$x, pca_eiad[[2]])
percentVar <- pca_eiad[[3]]
pcFirst <- 'PC1'
pcSecond <- 'PC2'
rotations <- c(pcFirst, pcSecond)
dataGG %>%
#filter(Source != 'scRNA') %>%
as_tibble() %>%
mutate(Tissue = case_when(Tissue == 'Brain' ~ Tissue,
Cohort == 'Body' ~ 'Body',
TRUE ~ Tissue)) %>%
ggplot(., aes(.data[[pcFirst]], .data[[pcSecond]])) +
geom_point(size=1, aes(color=Tissue, shape = Source)) +
xlab(paste0(pcFirst, ": ",percentVar[str_extract(pcFirst, '\\d+') %>% as.integer()],"% variance")) +
ylab(paste0(pcSecond, ": ",percentVar[str_extract(pcSecond, '\\d+') %>% as.integer()],"% variance")) +
cowplot::theme_cowplot() +
scale_color_manual(values = c(pals::glasbey(), pals::alphabet2(), pals::alphabet2()) %>% unname()) +
scale_fill_manual(values = c(pals::glasbey(), pals::alphabet2(), pals::alphabet2()) %>% unname()) +
scale_shape_manual(values = 0:10)
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
p <- dataGG %>%
left_join(predictions, by = c("sample_accession" = "sample_id")) %>%
as_tibble() %>%
mutate(Tissue = case_when(Tissue == 'Brain' ~ Tissue,
Cohort == 'Body' ~ 'Body',
TRUE ~ Tissue),
Mislabel = case_when(sample_label != predict ~ 'Yes', TRUE ~ 'No'),
Label = paste(Mislabel, sample_accession, Sub_Tissue, Source, Age, sep = '\n')) %>%
ggplot(., aes(.data[[pcFirst]], .data[[pcSecond]], label = Label)) +
geom_point(size=1, aes(color=Tissue, shape = Source)) +
xlab(paste0(pcFirst, ": ",percentVar[str_extract(pcFirst, '\\d+') %>% as.integer()],"% variance")) +
ylab(paste0(pcSecond, ": ",percentVar[str_extract(pcSecond, '\\d+') %>% as.integer()],"% variance")) +
cowplot::theme_cowplot() +
scale_color_manual(values = c(pals::glasbey(), pals::alphabet2(), pals::alphabet2()) %>% unname()) +
scale_fill_manual(values = c(pals::glasbey(), pals::alphabet2(), pals::alphabet2()) %>% unname()) +
scale_shape_manual(values = 0:10)
ggplotly(p)
Where higher values (near 1) mean higher confidence in the label assignment
pcFirst <- 'PC1'
pcSecond <- 'PC2'
dataGG %>%
filter(Tissue %in% tissues) %>%
as_tibble() %>%
mutate(Tissue = case_when(Tissue == 'Brain' ~ Tissue,
Cohort == 'Body' ~ 'Body',
TRUE ~ Tissue)) %>%
left_join(pred_values, by = c('sample_accession' = 'sample_id')) %>%
mutate(mean_max_score = case_when(mean_max_score > 1 ~ 1, TRUE ~ mean_max_score)) %>%
ggplot(., aes(.data[[pcFirst]], .data[[pcSecond]])) +
geom_point(size=3, aes(color=mean_max_score, shape = Source)) +
xlab(paste0(pcFirst, ": ",percentVar[str_extract(pcFirst, '\\d+') %>% as.integer()],"% variance")) +
ylab(paste0(pcSecond, ": ",percentVar[str_extract(pcSecond, '\\d+') %>% as.integer()],"% variance")) +
cowplot::theme_cowplot() +
scale_color_viridis_c() +
scale_shape_manual(values = 0:10) + facet_wrap(~Tissue)
pcFirst <- 'PC3'
pcSecond <- 'PC4'
dataGG %>%
filter(Tissue %in% tissues) %>%
as_tibble() %>%
mutate(Tissue = case_when(Tissue == 'Brain' ~ Tissue,
Cohort == 'Body' ~ 'Body',
TRUE ~ Tissue)) %>%
left_join(pred_values, by = c('sample_accession' = 'sample_id')) %>%
mutate(mean_max_score = case_when(mean_max_score > 1 ~ 1, TRUE ~ mean_max_score)) %>%
ggplot(., aes(.data[[pcFirst]], .data[[pcSecond]])) +
geom_point(size=3, aes(color=mean_max_score, shape = Source)) +
xlab(paste0(pcFirst, ": ",percentVar[str_extract(pcFirst, '\\d+') %>% as.integer()],"% variance")) +
ylab(paste0(pcSecond, ": ",percentVar[str_extract(pcSecond, '\\d+') %>% as.integer()],"% variance")) +
cowplot::theme_cowplot() +
scale_color_viridis_c() +
scale_shape_manual(values = 0:10) + facet_wrap(~Tissue)
Where lower values are more confident in the metamoRph tissue score
“Suspicious” samples are idenfied by three criteria:
sus <- predictions %>% filter(sample_label != predict | mean_max_score < 0.7 | miscall_count > 9)
for (i in c(1,3,5,7,9)){
pcFirst <- paste0('PC', i)
pcSecond <- paste0('PC', i+1)
print(dataGG %>%
filter(Tissue %in% tissues) %>%
as_tibble() %>%
mutate(Tissue = case_when(Tissue == 'Brain' ~ Tissue,
Cohort == 'Body' ~ 'Body',
TRUE ~ Tissue)) %>%
mutate(Suspicious = case_when(sample_accession %in% sus$sample_id ~ 'Yes')) %>%
ggplot(., aes(.data[[pcFirst]], .data[[pcSecond]])) +
geom_point(size=3, aes(color=Suspicious, shape = Source)) +
xlab(paste0(pcFirst, ": ",percentVar[str_extract(pcFirst, '\\d+') %>%
as.integer()],"% variance")) +
ylab(paste0(pcSecond, ": ",percentVar[str_extract(pcSecond, '\\d+') %>%
as.integer()],"% variance")) +
cowplot::theme_cowplot() +
scale_shape_manual(values = 0:10) + facet_wrap(~Tissue)
)}
library(ComplexHeatmap)
Loading required package: grid
========================================
ComplexHeatmap version 2.16.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
If you use it in published research, please cite either one:
- Gu, Z. Complex Heatmap Visualization. iMeta 2022.
- Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
genomic data. Bioinformatics 2016.
The new InteractiveComplexHeatmap package can directly export static
complex heatmaps into an interactive Shiny app with zero effort. Have a try!
This message can be suppressed by:
suppressPackageStartupMessages(library(ComplexHeatmap))
========================================
Attaching package: ‘ComplexHeatmap’
The following object is masked from ‘package:plotly’:
add_heatmap
studies_with_sus <- emeta %>% filter(sample_accession %in% sus$sample_id) %>% pull(study_accession) %>% unique()
full_df <- cbind(predictions, pred_values %>% select(Conjunctiva:`Trabecular.Meshwork`))
for (i in studies_with_sus){
#print(i)
hm_df <- left_join(predictions, pred_values %>% select(sample_id, Conjunctiva:`Trabecular.Meshwork`) %>% unique(), by = 'sample_id') %>%
left_join(emeta %>% select(sample_id = sample_accession, study_accession, Sub_Tissue, Source, Age), by ='sample_id') %>%
filter(study_accession %in% i) %>%
data.frame() %>% unique()
row.names(hm_df) <- hm_df$sample_id
ha_side = rowAnnotation(df = data.frame(ScientistLabel =
hm_df$sample_label %>% factor(levels = unique(full_df$sample_label)),
eigenLabel =
hm_df$predict %>% factor(levels = unique(full_df$sample_label)),
Source =
hm_df$Source %>% factor(levels = unique(hm_df$Source)),
Age =
hm_df$Age %>% factor(levels = unique(hm_df$Age))),
col = list(ScientistLabel = c("Conjunctiva" = pals::alphabet2(20)[2] %>% unname(),
"Cornea" = pals::alphabet2(20)[3] %>% unname(),
"Lens" = pals::alphabet2(20)[6] %>% unname(),
"Retina" = pals::alphabet2(20)[10] %>% unname(),
"RPE" = pals::alphabet2(20)[13] %>% unname(),
"Trabecular Meshwork" = pals::alphabet2(20)[16] %>% unname()),
eigenLabel = c("Conjunctiva" = pals::alphabet2(20)[2] %>% unname(),
"Cornea" = pals::alphabet2(20)[3] %>% unname(),
"Lens" = pals::alphabet2(20)[6] %>% unname(),
"Retina" = pals::alphabet2(20)[10] %>% unname(),
"RPE" = pals::alphabet2(20)[13] %>% unname(),
"Trabecular Meshwork" = pals::alphabet2(20)[16] %>% unname())))
print(Heatmap(as.matrix(hm_df[,8:13]), col=circlize::colorRamp2(c(0, 0.5, 1), c("blue", "white", "red")),
right_annotation = ha_side, cluster_rows = FALSE, column_title = i))
pcFirst <- "PC1"
pcSecond <- 'PC2'
print(dataGG %>%
filter(study_accession == i, Tissue %in% tissues) %>%
as_tibble() %>%
mutate(Tissue = case_when(Tissue == 'Brain' ~ Tissue,
Cohort == 'Body' ~ 'Body',
TRUE ~ Tissue)) %>%
mutate(Suspicious = case_when(sample_accession %in% sus$sample_id ~ 'Yes', TRUE ~ 'No')) %>%
ggplot(., aes(.data[[pcFirst]], .data[[pcSecond]], label = sample_accession)) +
geom_point(data = dataGG %>%
filter(study_accession != i) %>%
filter(Tissue %in% tissues),
size=2,
aes(x =.data[[pcFirst]], y= .data[[pcSecond]]), color = 'grey' ) +
geom_point(size=3, aes(color=Suspicious, shape = Source)) +
ggrepel::geom_label_repel( size = 2.5, max.overlaps = Inf) +
xlab(paste0(pcFirst, ": ",percentVar[str_extract(pcFirst, '\\d+') %>%
as.integer()],"% variance")) +
ylab(paste0(pcSecond, ": ",percentVar[str_extract(pcSecond, '\\d+') %>%
as.integer()],"% variance")) +
cowplot::theme_cowplot() +
scale_shape_manual(values = 0:10) + facet_wrap(~Tissue, ncol = 2)
)
pcFirst <- "PC3"
pcSecond <- 'PC4'
print(dataGG %>%
filter(study_accession == i,
Tissue %in% tissues) %>%
as_tibble() %>%
mutate(Tissue = case_when(Tissue == 'Brain' ~ Tissue,
Cohort == 'Body' ~ 'Body',
TRUE ~ Tissue)) %>%
mutate(Suspicious = case_when(sample_accession %in% sus$sample_id ~ 'Yes', TRUE ~ 'No')) %>%
ggplot(., aes(.data[[pcFirst]], .data[[pcSecond]], label = sample_accession)) +
geom_point(data = dataGG %>%
filter(study_accession != i) %>%
filter(Tissue %in% tissues),
size=2,
aes(x =.data[[pcFirst]], y= .data[[pcSecond]]), color = 'grey' ) +
geom_point(size=3, aes(color=Suspicious, shape = Source)) +
ggrepel::geom_label_repel(size = 2.5, max.overlaps = Inf) +
xlab(paste0(pcFirst, ": ",percentVar[str_extract(pcFirst, '\\d+') %>%
as.integer()],"% variance")) +
ylab(paste0(pcSecond, ": ",percentVar[str_extract(pcSecond, '\\d+') %>%
as.integer()],"% variance")) +
cowplot::theme_cowplot() +
scale_shape_manual(values = 0:10) + facet_wrap(~Tissue, ncol = 2)
)
}
Picked by using the plots above and checking out the full metadata to see if there were any other explanations for what appear to be outliers (an example is SRP105756 which has many early fetal retina tissues or SRP097696 which has a surprising flow sorted retina which only keeps endothelial cells).
eigen_hand_removed <- c('SRS6509684',
'SRS390609',
'SRS542573','SRS542574',
'SRS1478834', #
'SRS8145606', #
'SRS360123','SRS360124', #
'SRS7504868','SRS7504869', #
'SRS523835','SRS523813',
'SRS1955494')
predictions %>% left_join(emeta %>% select(study_accession, sample_accession, Sub_Tissue, Source, Age), by = c('sample_id'='sample_accession')) %>% as_tibble() %>% filter(sample_id %in% eigen_hand_removed) %>% arrange(sample_label)
predictions %>%
left_join(emeta %>% ungroup() %>% dplyr::select(study_accession, sample_accession, Sub_Tissue, Source, Age),
by = c('sample_id'='sample_accession')) %>%
as_tibble() %>%
filter(sample_id %in% eigen_hand_removed) %>%
arrange(sample_label) %>% group_by(sample_label, Sub_Tissue, Source, Age) %>% summarise(Count = n(), samples = paste0(sample_id, collapse = ', '))
`summarise()` has grouped output by 'sample_label', 'Sub_Tissue', 'Source'. You can override using the `.groups` argument.
for (i in c(1,3,5,7)){
pcFirst <- paste0('PC', i)
pcSecond <- paste0('PC', i+1)
print(dataGG %>%
as_tibble() %>%
mutate(Tissue = case_when(Tissue == 'Brain' ~ Tissue,
Cohort == 'Body' ~ 'Body',
TRUE ~ Tissue)) %>%
mutate(Removed = case_when(sample_accession %in% eigen_hand_removed ~ 'Yes')) %>%
ggplot(., aes(.data[[pcFirst]], .data[[pcSecond]])) +
geom_point(size=1, aes(color=Removed, shape = Source)) +
xlab(paste0(pcFirst, ": ",percentVar[str_extract(pcFirst, '\\d+') %>% as.integer()],"% variance")) +
ylab(paste0(pcSecond, ": ",percentVar[str_extract(pcSecond, '\\d+') %>% as.integer()],"% variance")) +
cowplot::theme_cowplot() +
scale_color_manual(values = c(pals::glasbey(), pals::alphabet2(), pals::alphabet2()) %>% unname()) +
scale_fill_manual(values = c(pals::glasbey(), pals::alphabet2(), pals::alphabet2()) %>% unname()) +
scale_shape_manual(values = 0:10) +
facet_wrap(~Tissue)
)
}
write(eigen_hand_removed, file = '../data/excluded_samples.txt')
devtools::session_info()